Growing distributed networks with arbitrary degree distributions 
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We consider distributed networks, such as peer-to-peer networks, whose structure can be manip- 
ulated by adjusting the rules by which vertices enter and leave the network. We focus in particular 
on degree distributions and show that, with some mild constraints, it is possible by a suitable choice 
of rules to arrange for the network to have any degree distribution we desire. We also describe 
a mechanism based on biased random walks by which appropriate rules could be implemented in 
practice. As an example application, we describe and simulate the construction of a peer-to-peer 
network optimized to minimize search times and bandwidth requirements. 



I. INTRODUCTION 

Complex networks, such as the Internet, the worldwide 
web, and social and biological networks, have attracted 
a remarkable amount of attention from the physics com- 
munity in recent years [H, 0, 0, 0| • Most studies of these 
systems have concentrated on determining their struc- 
ture or the effects that structure has on the behavior of 
the system. For instance, a considerable amount of effort 
has been devoted to studies of the degree distributions 
of networks, their measurement and the formulation of 
theories to explain how they come to take the observed 
forms, and models of the effect of particular degree dis- 
tributions on dynamical processes on networks, network 
resilience, percolation properties, and many other phe- 
nomena. Such studies are appropriate for "naturally oc- 
curring" networks, whose structure grows or is created 
according to some set of rules not under our direct con- 
trol. The Internet, the web, and social networks fall into 
this category even though they are man-made, since their 
growth is distributed and not under the control of any 
single authority. 

Not all networks fall in this class however. There are 
some networks whose structure is centrally controlled, 
such as telephone networks, some transportation net- 
works, or distribution networks like power grids. For 
these networks it is interesting to ask how, if one can 
design the network to have any structure one pleases, 
one could choose that structure to optimize some desired 
property of the network. For instance, Paul et al. || have 
considered how the structure of a network should be cho- 
sen to optimize the network's robustness to deletion of its 
vertices. 

In this paper we study a class of networks that falls 
between these two types. There are some networks that 
grow in a collaborative, distributed fashion, so that we 
cannot control the network's structure directly. But we 
can control some of the rules by which the network forms 
and this in turn allows us a limited degree of influence 
over the structure. The archetypal example of such a 
system is a distributed database such as a peer-to-peer 
filesharing network, which is a virtual network of linked 
computers that share data among themselves. The net- 



work is formed by a dynamical process under which indi- 
vidual computers continually join or leave the network, 
and the rules of joining and leaving can be manipulated 
to some extent by changing the behavior of the software 
governing computers' behaviors. It is well established 
that the structure of peer-to-peer networks can have a 
strong effect on their performance @, 0] but to a large 
extent that structure has in the past been regarded as 
an experimentally determined quantity [8] . Here we con- 
sider ways in which the structure can be manipulated by 
changing the behavior of individual nodes so as to opti- 
mize network performance. 



II. GROWING NETWORKS WITH DESIRED 
PROPERTIES 

In this paper we focus primarily on creating networks 
with desired degree distributions: the degree distribu- 
tion typically has a strong effect on the behavior of the 
network and is relatively straightforward to treat math- 
ematically. There are two basic problems we need to 
address if we want to create a network with a specific 
degree distribution solely by manipulating the rules by 
which vertices enter and leave the network. First, we 
need to find rules that will achieve the desired result, 
and second, we need to find a practical mechanism that 
implements those rules and operates in reasonable time. 
We deal with these questions in order. 

Our approach to growing a network with a desired de- 
gree distribution is based on the idea of the "attachment 
kernel" introduced by Krapivsky and Redner Q . We as- 
sume that vertices join our network at intervals and that 
when they do so they form connections — edges — to some 
number of other vertices in the network. By designing 
the software appropriately, we can in a peer-to-peer net- 
work choose the number of edges a newly joining vertex 
makes and also, as we will shortly show, some crucial 
aspects of which other vertices those edges connect to. 

Let us define pk to be the degree distribution of our 
network at some time, i.e., the fraction of vertices having 
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degree k, which satisfies the normalization condition 



(1) 



fc=0 



And let us define the attachment kernel irk to be the 
probability that an edge from a newly appearing vertex 
connects to a particular preexisting vertex of degree fc, 
divided by the number n of vertices in the network. It 
is this attachment kernel that we will manipulate to pro- 
duce a desired degree distribution. The extra factor of n 
in the definition is not strictly necessary, but it is con- 
venient: since the total number of vertices of degree k is 
npk , it means that the probability of a new edge connect- 
ing to any vertex of degree k is just ftkPk, and hence iTk 
satisfies the normalization condition 



fe=o 



KkPk = 1. 



(2) 



In a peer-to-peer network users may exit the network 
whenever they want and we as designers have little con- 
trol over this aspect of the network dynamics. We will 
assume in the calculations that follow that vertices sim- 
ply vanish at random. We will also assume that, on the 
typical time-scales over which people enter and leave the 
network, the total size n of the network does not change 
substantially, so that the rates at which vertices enter and 
leave are roughly equal. For simplicity let us say that ex- 
actly one vertex enters the network and one leaves per 
unit time (although the results presented here are in fact 
still valid even if only the probabilities per unit time of 
addition and deletion of vertices are equal and not the 
rates) . 

Now let us chose the initial degrees of vertices when 
they join the network, i.e., the number of connections 
that they form upon entering, at random from some dis- 
tribution ffc. Building on our previous results in [lol ]. 
we observe that the evolution of the degree distribution 
of our network can be described by a rate equation as 
follows. The number of vertices with degree A; at a par- 
ticular time is npk- One unit of time later we have added 
one vertex and taken away one vertex, so that the number 
with degree k becomes 

np'j, = np k + CTTk^xPk-X - CTTkPk 

+ (k + l)pk+i - kp k -pk + a, (3) 

with the convention that p_j = 0, and c = X)fc°=o^ rfc > 
which is the average degree of vertices added to the net- 
work. The terms CKk-iPk-i and —CKuPk in Eq. ([3]) rep- 
resent the flow of vertices with degree k — 1 to k and k to 
k + 1, as they gain extra edges with the addition of new 
vertices. The terms [k + l)pk+i and — kpk represent the 
flow of vertices with degree k+ 1 and k to k and k — 1, as 
they lose edges with the removal of neighboring vertices. 
The term —pk represents the probability of removal of a 
vertex of degree k and the term r& represents the addition 
of a new vertex with degree k to the network. 



Assuming pk has an asymptotic form in the limit of 
large time, that form is given by setting p' k = pk thus: 

cir k -iPk-i - cTT k pk + (k + l)pk+i - kpk 

-Pk + r k = 0. (4) 

Following previous convention (llj , let us define a gener- 
ating function Gq{z) for the degree distribution thus: 



G (z) 



E 

fe=0 



PkZ 



(5) 



as well as generating functions for the degrees of vertices 
added and for the attachment kernel thus: 



F{z) 
H(z) 



OO 

k=a 



^TTkPkZ 10 . 

k=Q 



(6) 
(7) 



Multiplying both sides of (|4]) by z k and summing over k, 
we then find that the generating functions satisfy the 
differential equation 

(1 - - Go(z) - c(l - z)H(z) + F(z) = 0. (8) 

az 

We are interested in creating a network with a given 
degree distribution, i.e., with a given Gq (z). Rearrang- 
ing ((5J) , we find that the choice of attachment kernel iik 
that achieves this is such that 



H{z) 



dgp F(z)-G Q (z) 
dz 



1 



(9) 



Taking the limit z — > 1, noting that normalization re- 
quires that all the generating functions tend to 1 at z — 1 , 
and applying L'Hopital's rule, we find 



1 



1 



[(k) + (k) - c] 



(10) 



where we have made use of the fact that the average 
degree in the network is (k) = G' (l) and c = H'(l). 
Rearranging, we then find that c = (k). In other words, 
solutions to Eq. ||5J) require that the average degree c of 
vertices added to the network be equal to the average 
degree of vertices in the network as a whole. Making use 
of this result, we can write Eq. @ in the form 



H(z) = G 1 (z) + 



F(z) 



G (z) 



c(l - z) 



(11) 



where G%(z) = G' (z) / G' (l) = J^k 1 kzk 1S ^ ne generating 
function for the so-called excess degree distribution 



(k + l)p k 



(k) 



(12) 



which appears in many other network-related 
calculations — see, for instance, Ref. [ill ]. 
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Now it is straightforward to derive the desired attach- 
ment kernel. Noting that 



1 



1-z 



£< 

fc=0 



(13) 



we can simply read off the coefficient of z k on cither side 
of Eq. ([IT]), to give 



1 k 

TTfcPfc = Qk + ~ (r m - Pm), 

C 

m=0 



or equivalently 



1 

CPfe 



[(k + l) Pk+1 + P k+1 - R k+1 ] , 



(14) 



(15) 



where P k is the cumulative distribution of vertex degrees 
and R k is the cumulative distribution of added degrees: 



P k = 



m=k 



Rk — 



m— fe 



(16) 



Since we are at liberty to choose both rfc and 7Tfe, we 
have many options for satisfying Eq. (|15[) : given (almost) 
any choice of the distribution r k of the degrees of added 
vertices, we can find the corresponding ir k that will give 
the desired final degree distribution of the network. One 
simple choice would be to make the degree distribution 
of the added vertices the same as the desired degree dis- 
tribution, so that R k — Pk- Then 



TTfc 



q k (k + l)pfc+l 



Pk 



cpk 



(17) 



In other words, if we have some desired degree distribu- 
tion p k for our network, one way to achieve it is to add 
vertices with exactly that degree distribution and then 
arrange the attachment process so that the degree dis- 
tribution remains preserved thereafter, even as vertices 
and edges are added to and removed from the network. 
Equation (fT7|) tells us the choice of attachment kernel 
that will achieve this. Equation (fT7|) will work for essen- 
tially any choice of degree distribution p k , except choices 
for which pk = and Pk+i > for some k. In the latter 
case Eq. (fT7|) will diverge for some value(s) of k. 



A. Example: power-law degree distribution 

As an example, consider the creation of a network with 
a power-law degree distribution. Adamic et al. [|| have 
shown that search processes on peer-to-peer networks 
with power-law degree distributions are particularly effi- 
cient, so there are reasons why one might want to gener- 
ate such a network. 

Let us choose 



Pk = 



Ck~^ for k > 1, 
Pq for k = 0, 



(18) 



where 7 and po are constants and the normalizing fac- 
tor C is given by 



C 



1 -Po 

C(7) ' 



(19) 



where £(7) is the Riemann zeta-function. Then the mean 
degree is 



(fc) = c=(l-p<,) 



C(7 - 1) 
C(7) ' 



(20) 



and Eq. (fT7|) tells us that the correct choice of attachment 
kernel in this case is 



TTfc 

for k > 1 and 



CM 



jfeT 



1-Po C(7-l) (fc+1)^ 1 ' 



poCd - 1)' 



(21) 



(22) 



It is interesting to note that as k becomes large, this 
attachment kernel goes as ir k ~ k, the so-called (linear) 
preferential attachment form in which vertices connect to 
others in simple proportion to their current degree. In 
growing networks this form is known to give rise, asymp- 
totically, to a power-law degree distribution. It is impor- 
tant to understand, however, that in the present case the 
network is not growing and hence, despite the apparent 
similarity, this is not the same result. Indeed, it is known 
that for non-growing networks, purely linear preferential 
attachment does not produce power-law degree distribu- 
tions 013, but instead generates stretched exponen- 
tial distributions (Toj . Thus it is somewhat surprising 
to observe that one can, nonetheless, create a power-law 
degree distribution in a non-growing network using an 
attachment kernel that seems, superficially, quite close 
to the linear form. 

Sarshar and Roychowdhury [12j showed previously 
that it is possible to generate a non-growing power-law 
network by using linear preferential attachment and then 
compensating for the expected loss of power-law behavior 
by rewiring the connections of some vertices after their 
addition to the network. Our results indicate that, al- 
though this process will certainly work, it is not neces- 
sary: a slight modification to the preferential attachment 
process will achieve the same goal and frees us from the 
need to rewire any edges. 

Note also that (|2"Tj) is not the only solution of Eq. (fT5|) 
that will generate a power-law distribution. If we choose 
a different (e.g., non-power-law) distribution for the ver- 
tices added to the network, we can still generate an over- 
all power-law distribution by choosing the attachment 
kernel to satisfy Eq. (fT5|) . Suppose, for instance, that, 
rather than adding vertices with a power-law degree dis- 
tribution, we prefer to give them a Poisson distribution 
with mean c: 



rk 



c 



(23) 



4 



In this case Rk = 1 — T(k, c)/T(k), where r(fc) is the 
standard gamma function and T(fc, c) is the incomplete 
gamma function. Then the power law is correctly gener- 
ated by the choice 



1 



TTfe 



C(7) 



k 1 



1-po C(7-l) 

C(7) A r(fc 



i 



Po 



1 



(fc + l)-T +1 +C(7,^ + l) 



r(fc + i) 



(24) 



for k > 1, where £(7,2;) is the generalized zeta function 
C(7^)=Er=o( fc + ^)" 7 - Forfc = 0, 



7T = 



PoC(7 - 1) 



1 -Po 



C(7) 



(25) 



III. A PRACTICAL IMPLEMENTATION 

In theory, we should be able use the ideas of the pre- 
vious section to grow a network with a desired degree 
distribution. This does not, however, yet mean we can 
do so in practice. To make our scheme a practical real- 
ity, we still need to devise a realistic way to place edges 
between vertices with the desired attachment kernel 7Tfe. 
If each vertex entering the network knew the identities 
and degrees of all other vertices, this would be easy: we 
would simply select a degree k at random in proportion 
to TTkPk, and then attach our new edge to a vertex chosen 
uniformly at random from those having that degree. 

In the real world, however, and particularly in peer-to- 
peer networks, no vertex "knows" the identity of all oth- 
ers. Typically, computers only know the identities (such 
as IP addresses) of their immediate network neighbors. 
To get around this problem, we propose the following 
scheme, which makes use of biased random walks. 

A random walk, in this context, is a succession of steps 
along edges in our network where at each vertex i we 
choose to step next to a vertex chosen at random from 
the set of neighbors of i. In the context of a peer-to- 
peer computer network, for example, such a walk can 
be implemented by message passing between peers. The 
"walker" is a message or data packet that is passed from 
computer to neighboring computer, with each computer 
making random choices about which neighbor to pass to 
next. 

Starting a walk from any vertex in the network, we 
can sample vertices by allowing the walk to take some 
fixed number of steps and then choosing the vertex that 
it lands upon on its final step. We will consider random 
walks in which the choice of which step to make at each 
vertex is deliberately biased to create a desired probabil- 
ity distribution for the sample as follows. 

Consider a walk in which a walker at vertex j chooses 
uniformly at random one of the kj neighbors of that ver- 
tex. Let us call this neighbor i. Then the walk takes 
a step to vertex i with some acceptance probability . 
The total probability of a transition from j to i given 



that we are currently at j is 



_ Ay 
3 ~ k 



%3i 



(26) 



where kj is the degree of vertex j and Aij is an element 
of the adjacency matrix: 

I 1 if there is an edge joining vertices i,j, 
13 1 otherwise. 

If the step is not accepted, then the random walker re- 
mains at vertex j for the current step. 

This random walk constitutes an ordinary Markov pro- 
cess, which converges to a distribution pi over vertices 
provided the network is connected (i.e., consists of a sin- 
gle component) and provided satisfies the detailed 
balance condition 



TijPj — TjiPi. 



(28) 



In the present case we wish to select vertices in pro- 
portion to the attachment kernel ir^ . Setting pi = 71"^ , 
this implies that T^- should satisfy 



T 



Pi 



Ji 



Pj 7T kj 



(29) 



Or, making use of Eqs. (fTT|) and (|26[) for the case where 
rk =Pk, we find 



Pij _ {h + l)pki+i 



kjPkj 



(30) 



Pji k,pi, {kj + l)pk J+ i '//. 1 



where qk is again the excess degree distribution, Eq. (Q2 

In practice, we can satisfy this equation by making the 
standard Metropolis-Hastings choice for the acceptance 
probability: 



P — 



qkiqkj-x/qkjQki 



1 



if Qkjqki-i < qkjqkj-i, 
otherwise. 

(31) 

Thus the calculation of the acceptance probability re- 
quires only that each vertex know the degrees of its neigh- 
boring vertices, which can be established by a brief ex- 
change of data when the need arises. 

As an example, suppose we wish to generate a network 
with a Poisson degree distribution 



-u /" 



(32) 



where /i is the mean of the Poisson distribution. Then 
we find that the appropriate choice of acceptance ratio is 



p. 



kj j k<i if k<i ^> k 



1 



otherwise. 



(33) 



(As discussed above, we must also make sure to choose 
the mean degree c of vertices added to the network to be 
equal to ji.) 
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Our proposed method for creating a network is thus 
as follows. Each newly joining vertex i first chooses a 
degree k for itself, which is drawn from the desired dis- 
tribution pk- It must also locate one single other vertex j 
in the network. It might do this for instance using a list 
of known previous members of the network or a stan- 
dardized list of permanent network members. Vertex j 
is probably not selected randomly from the network, so 
it is not chosen as a neighbor of i. Instead, we use it as 
the starting point for a set of k biased random walkers 
of the type described above. Each walker consists of a 
message, which starts at j and propagates through the 
network by being passed from computer to neighboring 
computer. The message contains (at a minimum) the 
address of the computer at vertex i as well as a counter 
that is updated by each computer to record the number 
of steps the walker has taken. (Bear in mind that steps 
on which the walker doesn't move, because the proposed 
move was rejected, are still counted as steps.) The com- 
puter that the walker reaches on its tth step, where Ms a 
fixed but generous constant chosen to allow enough time 
for mixing of the walk, establishes a new network edge 
between itself and vertex i and the walker is then deleted. 
When all k walkers have terminated in this way, vertex i 
has k new neighbors in the network, chosen in propor- 
tion to the correct attachment kernel for the desired 
distribution. After a suitable interval of time, this pro- 
cess will result in a network that has the chosen degree 
distribution pk, but is otherwise random. 

As a test of this method, we have performed simu- 
lations of the growth of a network with a Poisson de- 
gree distribution as in Eq. (|33[) . Starting from a random 
graph of the desired size n, we randomly add and remove 
vertices according to the prescription given above. Fig- 
ure [1] shows the resulting degree distribution for the case 
(i = 10, along with the expected Poisson distribution. 
As the figure shows, the agreement between the two is 
excellent. 



IV. EXAMPLE APPLICATION 

As an example of the application of these ideas we con- 
sider peer-to-peer networks. Bandwidth restrictions and 
search times place substantial constraints on the perfor- 
mance of peer-to-peer networks, and the methods of the 
previous sections can be used to nudge networks towards 
a structure that improves their performance in these re- 
spects. More sophisticated applications are certainly pos- 
sible, but the one presented here offers an indication of 
the kinds of possibilities open to us. 



A. Definition of the problem 

Consider a distributed database consisting of a set of 
computers each of which holds some data items. Copies 
of the same item can exist on more than one computer, 




1 10 100 

Degree k 



FIG. 1: The degree distribution for a network of n = 50 000 
vertices generated using the biased random walk mechanism 
described in the text with fj, = 10. The points represent 
the results of our simulations and the solid line is the target 
distribution, Eq. (|32)l . 



which would make searching easier, but we will not as- 
sume this to be the case. Computers are connected to- 
gether in a "virtual network," meaning that each com- 
puter is designated as a "neighbor" of some number of 
other computers. These connections between computers 
are purely notional: every computer can communicate 
with every other directly over the Internet or other phys- 
ical network. The virtual network is used only to limit 
the amount of information that computers have to keep 
about their peers. 

Each computer maintains a directory of the data items 
held by its network neighbors, but not by any other com- 
puters in the network. Searches for items are performed 
by passing a request for a particular item from computer 
to computer until it reaches one in whose directory that 
item appears, meaning that one of that computer's neigh- 
bors holds the item. The identity of the computer hold- 
ing the item is then transmitted back to the origin of the 
search and the origin and target computers communicate 
directly thereafter to negotiate the transfer of the item. 
This basic model is essentially the same as that used by 
other authors [|| as well as by many actual peer-to-peer 
networks in the real world. Note that it achieves effi- 
ciency by the use of relatively large directories at each 
vertex of the network, which inevitably use up mem- 
ory resources on the computers. However, with standard 
hash-coding techniques and for databases of the typical 
sizes encountered in practical situations (thousands or 
millions of items) the amounts of memory involved are 
quite modest by modern standards. 



B. Search time and bandwidth 

The two metrics of search performance that we con- 
sider in this example are search time and bandwidth, both 
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of which should be low for a good search algorithm. We 
define the search time to be the number of steps taken 
by a propagating search query before the desired target 
item is found. We define the bandwidth for a vertex as 
the average number of queries that pass through that ver- 
tex per unit time. Bandwidth is a measure of the actual 
communications bandwidth that vertices must expend to 
keep the network as a whole running smoothly, but it is 
also a rough measure of the CPU time they must devote 
to searches. Since these are limited resources it is crucial 
that we not allow the bandwidth to grow too quickly as 
vertices are added to the network, otherwise the size of 
the network will be constrained, a severe disadvantage 
for networks that can in some cases swell to encompass a 
significant fraction of all the computers on the planet. (In 
some early peer-to-peer networks, issues such as this did 
indeed place impractical limits on network size fLU flU-) 
Assuming that the average behavior of a user of the 
database remains essentially the same as the network 
gets larger, the number of queries launched per unit time 
should increase linearly with the size of the network, 
which in turn suggests that the bandwidth per vertex 
might also increase with network size, which would be 
a bad thing. As we will show, however, it is possible 
to avoid this by designing the topology of the network 
appropriately. 



C. Search strategies and search time 

In order to treat the search problem quantitatively, we 
need to define a search strategy or algorithm. Here we 
consider a very simple — even brainless — strategy, again 
employing the idea of a random walk. This random walk 
search is certainly not the most efficient strategy possi- 
ble, but it has two significant advantages for our pur- 
poses. First, it is simple enough to allow us to carry out 
analytic calculations of its performance. Second, as we 
will show, even this basic strategy can be made to work 
very well. Our results constitute an existence proof that 
good performance is achievable: searches are necessarily 
possible that are at least as good as those analyzed here. 

The definition of our random walk search is simple: the 
vertex i originating a search sends a query for the item it 
wishes to find to one of its neighbors j , chosen uniformly 
at random. If that item exists in the neighbor's directory 
the identity of the computer holding the item is trans- 
mitted to the originating vertex and the search ends. If 
not, then j passes the query to one of its neighbors cho- 
sen at random, and so forth. (One obvious improvement 
to the algorithm already suggests itself: that j not pass 
the query back to i again. As we have said, however, our 
goal is simplicity and we will allow such "backtracking" 
in the interests of simplifying the analysis.) 

We can study the behavior of this random walk search 
by a method similar to the one we employed for the anal- 
ysis of the biased random walks of Section [TTT1 Let pi be 
the probability that our random walker is at vertex i at 



a particular time. Then the probability p\ of its being at 
i one step later, assuming the target item has not been 
found, is 



As 



(34) 



where kj is the degree of vertex j and Aij is an element of 
the adjacency matrix, Eq. ([77]) . Under the same condi- 
tions as before the probability distribution over vertices 
then tends to the fixed point of (f34|) . which is at 



Pi 



2m' 



(35) 



where m is the total number of edges in the network. 
That is, the random walk visits vertices with probability 
proportional to their degrees. (An alternative statement 
of the same result is that the random walk visits edges 
uniformly.) 

When our random walker arrives at a previously un- 
visited vertex of degree ki, it "learns" from that vertex's 
directory about the items held by all immediate neigh- 
bors of the vertex, of which there are ki — 1 excluding 
the vertex we arrived from (whose items by definition 
we already know about). Thus at every step the walker 
gathers more information about the network. The av- 
erage number of vertices it learns about upon making a 
single step is J2iPi(ki — l)j with pi given by (f3"5)) . and 
hence the total number it learns about after r steps is 



(k) 



- 1 



(36) 



where (k) and (k 2 ) represent the mean and mean-square 
degrees in the network and we have made use of 2m = 
n(k). (There is in theory a correction to this result be- 
cause the random walker is allowed to backtrack and visit 
vertices visited previously. For a well-mixed walk, how- 
ever, this correction is of order l/(fc), which, as we will 
see, is negligible for the networks we will be considering.) 

How long will it take our walker to find the desired 
target item? That depends on how many instances of the 
target exist in the network. In many cases of practical 
interest, copies of items exist on a fixed fraction of the 
vertices in the network, which makes for quite an easy 
search. We will not however assume this to be the case 
here. Instead we will consider the much harder problem 
in which copies of the target item exist on only a fixed 
number of vertices, where that number could potentially 
be just 1. In this case, the walker will need to learn about 
the contents of O(n) vertices in order to find the target 
and hence the average time to find the target is given by 



(k) 



i 



An, 



for some constant A, or equivalently, 

n 



t = A 



(k 2 )/ (k) — 1 



(37) 



(38) 
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Consider, for instance, a network with a power-law de- 
gree distribution of the form, — Ck~ 7 , where 7 is a 
positive exponent and C is a normalizing constant cho- 
sen such that X^fcLo^k = 1< Real- world networks usually 
exhibit power-law behavior only over a certain range of 
degree. Taking the minimum of this range to be k = 1 
and denoting the maximum by fc max , we have 



(k) 



h 3 



7 

max 



k 2 -~* - 1 ' 



(39) 



Typical values of the exponent 7 fall in the range 2 < 
7 < 3, so that k 2 ^ is small for large fc max and can be 
ignored. On the other hand, k^J. becomes large in the 
same limit and hence {k 2 )/{k) ~ and 



(40) 



The scaling of the search time with system size n thus 
depends, in this case, on the scaling of the maximum 
degree /c max . 

As an example, Aiello et al. pL6j studied power-law 
degree distributions with a cut-off of the form fc max ~ 
n 1 / 7 , which gives 



,2-3/7 



(41) 



A similar result was obtained previously by 
Adamic et al. [6] using different methods. 



D. Bandwidth 

Bandwidth is the mean number of queries reaching a 
given vertex per unit time. Equation (|35[) tells us that 
the probability that a particular current query reaches 
vertex i at a particular time is fcj/2m, and assuming as 
discussed above that the number of queries initiated per 
unit time is proportional to the total number of vertices, 
the bandwidth for vertex i is 



Bn- 



2m 



B 



(42) 



where B is another constant. 

This implies that high-degree vertices will be over- 
loaded by comparison with low-degree ones so that, de- 
spite their good performance in terms of search times, 
networks with power-law or other highly right-skewed de- 
gree distributions may be undesirable in terms of band- 
width, with bottlenecks forming around the vertices of 
highest degree that could harm the performance of the 
entire network. If we wish to distribute load more evenly 
among the computers in our network, a network with a 
tightly peaked degree distribution is desirable. 



E. Choice of network 

A simple and attractive choice for our network is the 
Poisson distributed network of Section [TlTl For a Poisson 



degree distribution with mean \i we have (k) = \i and 
(k 2 ) = fi 2 -\- fji. Then, using Eq. (|3"5)) . the average search 
time is 



A—. 
H 



(43) 



As we have seen, a network of this type can be realized in 
practice with a biased-random-walker attachment mech- 
anism of the kind described in Section HTT1 

Now if we allow /1 to grow as some power of the size 
of the entire network, jj, ~ n a with < a < 1, then r ~ 
n 1_Q . For smaller values of a, searches will take longer, 
but vertices' degrees are lower on average meaning that 
each vertex will have to devote less memory resources 
to maintaining its directory. Conversely, for larger a, 
searches will be completed more quickly at the expense 
of greater memory usage. In the limiting case a — 1, 
searches are completed in constant time, independent of 
the network size, despite the simple-minded nature of the 
random walk search algorithm. 

The price we pay for this good performance is that the 
network becomes dense, having a number of edges scaling 
as n l+a . It is important to bear in mind, however, that 
this is a virtual network, in which the edges are a purely 
notional construct whose creation and maintenance car- 
ries essentially zero cost. There is a cost associated with 
the directories maintained by vertices, which for a = 1 
will contain information on the items held by a fixed frac- 
tion of all the vertices in the network. For instance, each 
vertex might be required to maintain a directory of 1% of 
all items in the network. Because of the nature of mod- 
ern computer technology, however, we don't expect this 
to create a significant problem. User time (for perform- 
ing searches) and CPU time and bandwidth are scarce 
resources that must be carefully conserved, but mem- 
ory space on hard disks is cheap, and the tens or even 
hundreds of megabytes needed to maintain a directory is 
considered in most cases to be a small investment. By 
making the choice a = 1 we can trade cheap memory re- 
sources for essentially optimal behavior in terms of search 
time and this is normally a good deal for the user. 

We note also that the search process is naturally paral- 
lelizable: there is nothing to stop the vertex originating 
a search from sending out several independent random 
walkers and the expected time to complete the search 
will be reduced by a factor of the number of walkers. Al- 
ternatively, we could reduce the degrees of all vertices in 
the network by a constant factor and increase the num- 
ber of walkers by the same factor, which would keep the 
average search time constant while reducing the sizes of 
the directories substantially, at the cost of increasing the 
average bandwidth load on each vertex. 

As a test of our proposed search scheme, we have 
performed simulations of the procedure on Poisson net- 
works generated using the random- walker method of Sec- 
tion (TTTJ Figure [2] shows as a function of network size the 
average time r taken by a random walker to find an item 
placed at a single randomly chosen vertex in the net- 
work. As we can see, the value of r does indeed tend to 
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FIG. 2: The time r for the random walk search to find an item 
deposited at a random vertex, as a function of the number of 
vertices n. 



a constant (about 100 steps in this case) as network size 
becomes large. 

We should also point out that for small values of // 
vertices with degree zero could cause a problem. A vertex 
that loses all of its edges because its neighbors have all 
left the network can no longer be reached by our random 
walkers, and hence no vertices can attach to them and our 
attachment scheme breaks down. However, in the case 
considered here, where \i becomes large, the number of 
such vertices is exponentially small, and hence they can 
be neglected without substantial deleterious effects. Any 
vertex that does find itself entirely disconnected from the 
network can simply rejoin by the standard mechanism. 



F. Item frequency distribution 

In most cases, the search problem posed above is not 
a realistic representation of typical search problems en- 
countered in peer-to-peer networks. In real networks, 
copies of items often occur in many places in the net- 
work. Let s be the number of times a particular item 
occurs in the network and let p s be the probability dis- 
tribution of s over the network, i.e., p s is the fraction of 
items that exist in s copies. 

If the item we are searching for exists in s copies, then 
Eq. ((43]) becomes 



to the item's popularity. Then the average time taken by 
a search is 



A- 



fis 



(44) 



since the chance of finding a copy of the desired item is 
multiplied by s on each step of the random walk. On the 
other hand, it is likely that the frequency of searches for 
items is not uniformly distributed: more popular items, 
that is those with higher s, are likely to be searched for 
more often than less popular ones. For the purposes of 
illustration, let us make the simple assumption that the 
frequency of searches for a particular item is proportional 



3=1 S PS T * 



= A 



SPs 



(45) 



where we have made use of J2 s Ps ~ 1 anc ^ 12s S P$ ~ ( s )- 
One possibility is that the total number of copies of 
items in the network increases in proportion to the num- 
ber of vertices, but that the number of distinct items 
remains roughly the same, so that the average number of 
copies of a particular item increases as (s) ~ n. In this 
case, (t) becomes independent of n even when fi is con- 
stant, since we have to search only a constant number of 
vertices, not a constant fraction, to find a desired item. 
Perhaps a more realistic possibility is that the number 
of distinct items increases with network size, but does so 
slower than n, in which case one can achieve constant 
search times with a mean degree /i that also increases 
slower than n, so that directory sizes measured as a frac- 
tion of the network size dwindle. 

An alternative scenario is one of items with a power- 
law frequency distribution p s ~ s~ s . This case describes, 
for example, most forms of mass art or culture including 
books and recordings, emails and other messages circu- 
lating on the Internet, and many others [l7j . The mean 
time to perform a search in the network then depends 
on the value of the exponent S. In many cases we have 
6 > 2, which means that (s) is finite and well-behaved 
as the database becomes large, and hence (r), Eq. ([45]). 
differs from Eq. (|4"5|) by only a constant factor. (That 
factor may be quite large, making a significant practical 
difference to the waiting time for searches to complete, 
but the scaling with system size is unchanged.) If 5 < 2, 
however, then (s) becomes ill-defined, having a formally 
divergent value, so that (r) — > as system size becomes 
large. Physically, this represents the case in which most 
searches are for the most commonly occurring items, and 
those items occur so commonly that most searches ter- 
minate very quickly. 

While this extra speed is a desirable feature of the 
search process, it's worth noting that average search time 
may not be the most important metric of performance 
for users of the network. In many situations, worst-case 
search time is a better measure of the ability of the search 
algorithm to meet users' demands. Assuming that the 
most infrequently occurring items in the network occur 
only once, or only a fixed number of times, the worst-case 
performance will still be given by Eq. (|4"5)) . 



G. Estimating network size 

One further detail remains to be considered. If we want 
to make the mean degree fi of vertices added to the net- 
work proportional to the size n of the entire network, or 
to some power of n, we need to know n. which presents 
a challenge since, as we have said, we do not expect any 
vertex to know the identity of all or even most of the other 
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the vertices. This problem can be solved using a breadth- 
first search, which can be implemented once again by 
message passing across the network. One vertex i chosen 
at random (or more realistically every vertex, at random 
but stochastically constant intervals proportional to sys- 
tem size) sends messages to some number d of randomly 
chosen neighbors. The message contains the address of 
vertex i, a unique identifier string, and a counter whose 
initial value is zero. Each receiving vertex increases the 
counter by 1, passes the message on to one of its neigh- 
bors, and also sends messages with the same address, 
identifier, and with counter zero to d — 1 other neigh- 
bors. Any vertex receiving a message with an identifier 
it has seen previously sends the value of the counter con- 
tained in that message back to vertex i, but docs not 
forward the message to any further vertices. If vertex i 
adds together all of the counter values it receives, the 
total will equal the number of vertices (other than itself) 
in the entire network. This number can then be broad- 
cast to every other vertex in the network using a similar 
breadth-first search (or perhaps as a part of the next such 
search instigated by vertex i.) 

The advantage of this process is that it has a total 
bandwidth cost (total number of messages sent) equal 
to dn. For constant d therefore, the cost per vertex is 
a constant and hence the process will scale to arbitrar- 
ily large networks without consuming bandwidth. The 
(worst-case) time taken by the process depends on the 
longest geodesic path between any two vertices in the net- 
work, which is O(logn). Although not as good as 0(1), 
this still allows the network to scale to exponentially large 
sizes before the time needed to measure network size be- 
comes an issue, and it seems likely that directory size 
(which scales linearly with or as a power of n depending 
on the precise algorithm) will become a limiting factor 
long before this happens. 

V. CONCLUSIONS 

In this paper, we have considered the problem of de- 
signing networks indirectly by manipulating the rules by 
which they evolve. For certain types of networks, such as 
peer-to-peer networks, the limited control that this ma- 
nipulation gives us over network structure, such as the 
ability to impose an arbitrary degree distribution of our 
choosing on the network, may be sufficient to generate 
significant improvements in network performance. Using 
generating function methods, we have shown that it is 
possible to impose a (nearly) arbitrary degree distribu- 
tion on a network by appropriate choice of the "attach- 



ment kernel" that governs how newly added vertices con- 
nect to the network. Furthermore, we have described a 
scheme based on biased random walks whereby arbitrary 
attachment kernels can be implemented in practice. 

We have also considered what particular choices of de- 
gree distribution offer the best performance in idealized 
networks under simple assumptions about search strate- 
gies and bandwidth constraints. We have given general 
formulas for search times and bandwidth usage per ver- 
tex and studied in detail one particularly simple case of a 
Poisson network that can be realized in straightforward 
fashion using our biased random walker scheme, allows 
us to perform decentralized searches in constant time, 
and makes only constant bandwidth demands per vertex, 
even in the limit where the database becomes arbitrar- 
ily large. No part of the scheme requires any centralized 
knowledge of the network, making the network a true 
peer-to-peer network, in the sense of having client nodes 
only and no servers. 

One important issue that we have neglected in our 
discussion is that of "supernodes" in the network. Be- 
cause the speed of previous search strategies has been 
recognized as a serious problem for peer-to-peer net- 
works, designers of some networks have chosen to des- 
ignate a subset of network vertices (typically those with 
above-average bandwidth and CPU resources) as super- 
nodes. These supernodes are themselves connected to- 
gether into a network over which all search activity takes 
place. Other client vertices then query this network when 
they want a search performed. Since the size of the su- 
pernode network is considerably less than the size of the 
network as a whole, this tactic increases the speed of 
searches, albeit only by a constant factor, at the expense 
of heavier load on the supernode machines. It would 
be elementary to generalize our approach to incorporate 
supernodes. One would simply give each supernode a di- 
rectory of the data items stored by the client vertices of 
its supernode neighbors. Then searches would take place 
exactly as before, but on the supernode network alone, 
and client vertices would query the supernode network to 
perform searches. In all other respects the mechanisms 
would remain the same. 
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